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Abstract 

We herein propose a new robust estimation method based on random pro- 
jections that is adaptive and, automatically produces a robust estimate, while 
enabling easy computations for high or infinite dimensional data. Under some re- 
stricted contamination models, the procedure is robust and attains full efficiency. 
We tested the method using both simulated and real data. 
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1 Introduction 

In the problem of robust estimation in high dimensional and functional data, a number 
of different issues arise that cause the well known robust multivariate estimators to fail, 
and new ideas are required to address these. 

Before stating our proposal, we first review the origins of robustness in order to 
identify some of the main ideas necessary to guide us in this complex problem. 

Ronchetti (2010) who stated that "the primary goal of robust statistics is the de- 
velopment of procedures which are still reliable and reasonably efficient under small 
deviations from the model, i.e. when the underlying distribution lies in a neighborhood 
of the assumed model. Robust statistics is then an extension of parametric statistics, 
taking into account that parametric models are at best only approximations to reality. 
The field is now some 50 years old. Indeed one can consider Tukey (1960), Huber (1964), 
and Hampel (1968) the fundamental papers which laid the foundations of modern robust 
statistics." 
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The median. For one dimensional data, the oldest robust estimates are the median 
and the a-trimmed means. The median is in several respects the 'most robust' consistent 
possible location estimate, and great efforts have been made to extend the median and 
the trimmed means to higher dimensions. The extension of these two simple notions to 
multivariate data took place via the two main avenues of impartial trimmed means and 
the concepts of data depth. For the multivariate case Gordaliza (1991) introduced the 
procedure of impartial trimmed means, which was then extended by Cuesta and Fraiman 
(2006) for functional data to obtain resistant estimates of the center of a functional 
distribution. More recently, several proposals considering trimming procedures have 
been introduced, among them Doha ct al (2007) and Cuevas et al (2008). A number 
of depth definitions have been introduced for multivariate data, the best known being 
the Tukey Depth (Tukcy, 1975) and the Simplicial Depth (Liu, 1988) and an interesting 
review of this subject is provided by Scrfiing (2006). For the case of functional data, 
several proposals have recently been made, including Fraiman and Muniz (2001), Cuevas 
and Fraiman (2009), Gervini (2010), among others. 

Diagnostic and automatic robust procedures. Two complementary approaches 
were developed to achieve robustness. Broadly speaking, the different procedures used 
to flag outliers are known as diagnostic procedures, while the term 'robustness' refers to 
statistical procedures that are insensitive to the effect of outliers. As observed by Serfiing 
(2006), for each depth function one can find a function associated with outlyingness. 
Hence, several of the foregoing procedures can help in the detection of outliers, such as 
the proposals of Fraiman and Muniz (2001) and Gervini (2010). Febrero et al (2008) 
introduced a depth measure for functional data that they used for the identification of 
outliers. 

Many procedures for the identification of outliers have been introduced, among them 
Rousseeuw and Van Zomeren (1990), Pena and Prieto (2001) and Febrero et al (2007). 

In robustness, a class of estimators can be defined in general in terms of the optimiza- 
tion of a specified target functional for the population version of the estimate. Maronna 
et al (2006), in Chapter 6, presented a detailed review of the different estimates of 
location and scatter for the multivariate case, describing the properties of each. 

In diagnostics, the problem lies in the identification of sub-samples whose deletion 
maximally changes a statistic of interest as measured by an appropriate target function. 

The diagnostic procedure straightforward for one dimensional data, but becomes 
much more difficult for multivariate data. This is also the case for automatic robust 
procedures. 

Adaptive estimates. In the early 1970's, some adaptive estimates were proposed 
for the location model. The word adaptive implied that the estimate adjusted itself 
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in some way to the data concerned. The simplest idea was to consider an M-estimate 
using Ruber's %1)k function, where the parameter K (that compromises robustness and 
efficiency) depends on the data, i.e. K — K{X-i, ... , X^) (see for instance Yohai, 1974). 
Roughly speaking, if there are no outliers on the sample, the estimate would tend to 
choose large values of K, while choosing smaller values of K correspond to samples that 
have outliers. These interesting ideas have been neglected ever since. 

Neighborhoods. The question arises whether full neighborhoods should be used 
for some distances between probabilities, or contamination neighborhood should. As 
mentioned above, we expect robust procedures to be stable under a small deviation 
from the model, i.e., when the underlying true distribution P lies in the neighborhood 
of the assumed model Pq- Some descriptions of neighborhoods are therefore required. We 
consider two types, the first being, V^^d — {P '■ D{P, Pq) < e}, where D is an appropriate 
distance between probabilities, derived from, for example Prohorov, Kolgomorov or 
Wasserstein metrics. The second type is of the form — {P : P — (1 — e)Po + eQ}, 
where Q is an arbitrary probability measure. Typically, Ve C Ve,D, which is the case 
for Prohorov or Kolmogorov (total variation) distances, for example, is particularly 
appealing, however. It may be perceived that with probability (1 — e), an observation 
of the sample comes from a distribution Pq, while for probability e it comes from Q. 
Recently, Alqallaf et al (2009) considered non standard data contamination models, such 
as componentwise outliers. 

Several notions and measures of robustness have been defined to provide a detailed 
description of the robustness characteristics of an estimate. The main ideas are: 

• Solving minimax problems in a neighborhood of the "true" model. The first ap- 
proach to formalize the robustness problem was that of Ruber's (1964, 1981) mini- 
max theory, where the statistical problem is viewed as a game between the Nature 
(which chooses a distribution in the neighborhood of the model) and the statisti- 
cian (who chooses a statistical procedure in a given class). The statistician achieves 
robustness by constructing a minimax procedure that minimizes a lost criterion 
in which the worst possible distribution in the neighborhood is considered. 

• Qualitative robustness. The notion of qualitative robustness was introduced by 
Rampel (1968) in terms of the equicontinuity of the distribution of the estimate 
with respect to the Prohorov distance between probabilities. 

• Measures of robustness. We herein briefly describe the most important measures 
of robustness. The influence curve is the inflnitesimal approach was introduced 
by Hampcl (1968) within the framework of robust estimation. The flnite sample 
breakdown point (introduced by Donoho and Ruber, 1983) of an estimate is the 
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maximum number of observations that can be replaced arbitrarily while the es- 
timate remains bounded (or far away from the boundary of the parameter space 
when the parameter space is compact). The maximal asymptotic bias as a func- 
tion of the contamination fraction, measures the worst behavior of an estimate 
under a fixed proportion of contamination. 

Equivariance properties. We expect that any reasonable estimate would only 
depend on the data, and not on the coordinates axis nor on the scale considered. In the 
one dimensional problem, the equivariance properties are just the location and the scale 
equivariance. In the multivariate case, this corresponds to the location equivariance, 
the orthogonal equivariance, and the global scale equivariance. These properties are 
still sensible in the infinite dimensional case. However, in the finite dimensional case, 
the aflinc equivariance properties often also required. However, this last condition is far 
from mandatory, and could also be a bad idea. For instance, for contours of multivari- 
ate quantiles, the requirement of affine equivariance implies that the contours will be 
convex sets, losing the important features shown by the non-convex versions that are or- 
thogonal, translation and scale equi variant but not affine equivariant. We consider that 
the affine equivariance requirement is inadequate for both high and infinite dimensional 
data. Alqallaf et al (2009) proved that standard high-breakdown affine equivariant esti- 
mators propagate outliers under componentwise contamination when the dimension d is 
high. There are many examples of estimates of location and scatter that are not affine 
equivariant but have many appealing properties (see for instance Maronna and Zamar 
(2002) or Praiman and Pateiro-Lopez (2011)). 

Computational burden. Most of the robust estimates for multivariate data, par- 
ticularly those that have improved properties of robustness, are very expensive from a 
computational point of view and, their computation is unfeasible for high dimensional 
data. This is also the case with some of the data depth based estimates. In the multivari- 
ate case, several attempts have been made in recent years to overcome these problems, 
by considering resampling algorithms on the initial candidates that allow a substan- 
tial reduction in the number of candidates required to obtain a good approximation to 
the optimal solution (see for instance Van Aelst and Rousseeuw, 2009 or Salibian and 
Yohai, 2006). In the case of high dimensional or functional data, it is essential that the 
estimates can be computed in a reasonable time. 

Random projections. The idea of using random projections has been already 
used by several authors in many different contexts. An appealing theoretical framework 
for the use of random projections can be found in Cuesta, Praiman and Ransford (2006, 
2007). 

We herein propose a new robust estimation procedure based on random projections 
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that is data-adaptive, produces automatic robust estimates and combines ideas from 
the diagnostic framework. In addition it has very low cost when apphed to high or 
infinite dimensional data. The method yields robust estimates that are location and scale 
equivariant and are also invariant under unitary operators (orthogonal transformations 
in the finite dimensional case). 

The remainder of the paper is organized as follows. In Section 2 we present our 
general method, and we exemplify with location and scatter estimates. In Section 3 we 
provide some asymptotic results. Section 4 is devoted to analysis of the performance 
of the procedure on simulations of the finite dimensional case and also for the case of 
functional data. In Section 5 we analyze an example using real data. Section 6 includes 
some final remarks. 

2 Our setup 

We start by fixing some notation. Let EI be a real separable Hilbert space, with inner 
product < ■, ■ >, and the induced norm ||-||h=: ||"||- X denote a random element in EI 
with distribution P in a contamination neighborhood of Pq, our "target distribution". 

The data are given by a random sample Xi, . . . , Xn of iid random elements with the 
same distribution as X. 

We are interested in estimating different parameters, including the mean, the covari- 
ance operator, and the principal components of a random element Xq with distribution 
Po based on the sample Xi, . . . , The problem hes in the fact that the distribution 
of the data is given by P rather than Pq. 

We assume that E(Xo) = /io is finite and that the covariance operator Sx„, is 
positive definite when EI is finite dimensional and compact and self-adjoint if EI is infinite 
dimensional. 

Our aim is to introduce a general method of obtaining consistent robust estimates in 
this framework that is both suitable for different problems and computationally feasible. 

If we were able to select the sub-sample Xi^ , . . . , Xi^ of those random elements that 
have distribution Pq, from the sample Xi, . . . ,X„ , the problem becomes trivial. We 
may apply usual "non-robust" estimates to the "good data" . 

We propose a trimming scheme based on random projections to select a sub-sample 
Xi^, . . . , Xi^, which we hereafter denote the RT-procedure. The maximum number of 
observations to be trimmed is a (0 < a < 0.5). The goal "is to discard only those 
observations that are far apart from the bulk of the data" . Our method is adaptive in 
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that it stops searching for outhers in a data dependent way. 

In the presence of outhers that he far away from the bulk of the data, there are 
always several directions m, = 1 (a set of positive probability on the unit sphere with 
respect to the induced probability measure) where on the projected data, {< Xi,u > 
,...,< Xn, u >}, the outliers are distant from the majority of the observations. If an 
observation is an outlier this phenomenon is observed in many directions. Therefore, 
the maximal gap of the one dimensional projected data, will be large, or more precisely, 
larger than expected. We recall that the maximal gap is the maximum distance between 
two contiguous observations. On the other hand, those observations that are deep in 
the bulk of the data set also have deep one dimensional projections in every direction. 

However, we may have a shadow cone, where the projected outliers are "hidden" for 
some directions u, as shown on Figure [T| 

We seek to define a pruning criterion that combines these two facts. We consider two 
parameters, the first being a, a G [0, 0.5], the maximal trimming rate, the second being, 
maxiter, the maximal number of random directions that we generate before deciding 
that all the outliers have been removed. 




Figure 1: The projections on the direction of the y axis are useful to detect the outliers, 
the projections on the direction oi y = —x are on the shadow cone. 

The subsample selection algorithm. 

The following procedure should be repeated either \na\ times (where \_k\ represents 
the largest integer smaller than or equal to k) or until the number of random directions 
generated, numdir, reaches maxiter, i.e. while numdir < maxiter. 

• Set numdir = 1. 

• Generate a direction h at random according to a non-degenerate gaussian measure 
u . 

• Project the data on that direction, Y/^ =< Xi, h >. 
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• Sort the projected observations in ascending order Y^^-^ < • • • < Y^y 

• Consider the maximum gap gn = maxj=i_...^„_i |y(tfi) ~ 

• li Qn > Cd, '\ trim the observation whose one-dimensional projection is most distant 
from the median of the projected observations, i.e. Xi such that i — arg max{|y^''— 
y^l for i — 1, . . . ,n} where Y^ — median (y^^ . . . , Y^^^ . 

• Else if gn < Cd, set numdir= numdir+ 1. 

We denote to be 7 the effective trimming rate, i.e., the actual rate of observations 
that have been reduced. 

The value of the threshold q on f should be related to the limiting behavior of 
the largest spacing. Deheuvels (1984), characterized it in terms of the local behavior 
of the density of the observations (in our case the one dimensional projections) in the 
neighborhood of the point at which it reaches its minimum value. According to Theorem 
4, he proved that if the distribution has compact support and the density is bounded 
away from zero, i.e. there exits Xq such that f{x) > /(xq) > for every Xq G Support{f), 
and if f has first derivative then, liminf„_!.oo = — 1. 

J ' ft— log(log(n)) 

This result suggest to take (for sufficiently large values of n) q = k ^°s(")~j°s(iog(")) ^ 
where k is a suitable constant greater than 1. There are many univariate distributions 
that do not have compact support, among which the normal distribution is probably 
the best known. In practice all the observable data sets have bounded support, hence 
our consideration of this expression might enable us to determine a useful threshold. 

Remcirk 1. If H is a finite dimensional space the random directions can be generated 
according to an uniform distribution on the unit sphere. 

Finally, we compute the ordinary (non-robust estimate) of our target parameters. 
For instance, the ML mean and covariance estimates are computed only with those 
observations that have not been trimmed. This implies, that if we assign weight w{Xi) — 
Wi — to those observations that have been pruned and w{Xi) — Wi — 1 to the 
remainder of the observations, then the mean and the covariance function estimates are 
given by 



E(s,i) = 



En •> 
i=l 

ELl W^ iX^ (S) - l2) (X, (t) - 



En 

respectively, where Y^^=i Wi — n— \_^n\ . 
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EquivEiriance properties 

The estimates are equivariant under translations, re-scaling and orthogonal transfor- 
mations. Indeed, it follows from the fact that the weights assigned to the observations 
are not affected by these operations, because we know that the usual mean and covari- 
ance estimates are not affected by them. 

It is clear that a shift and re-scaling with a positive constant will not affect the order 
statistics of the projections, and so the trimmed observations will be the same. Also, if 
T is an orthogonal transformation, then {Xi, h) — {TXi, Th) (because it is an isometry), 
which means that as the projections do not change, so neither do the weights. 

3 Some comments on the asymptotic behavior 

Let Xi, . . . ,Xn be lid random elements with the same distribution as X defined on 
(n, vA), and hi, . . . ,hk iid random directions on the unit sphere, independent of Xi, . . . , X, 
and defined on {Q', A'). After performing the algorithm we obtain a subset of random el- 
ements Xi^ , • • • , Xi^ , where m < n is random, and each X^ [cu, cu') ,j = 1 , . . . , m depends 
on the set of random elements {Xi, . . . , X„, hi, . . . , hk}. 

The distribution oi X is P — {1 — e)Po + eQ, where Pq is the core distribution, Q is 
an arbitrary contamination and e e [0,0.5). 

Let Pm{uj,uj') be the empirical measure associated with the set of random elements 
{Xi^, . . . , Xi^}. 

We wish to identify the conditions under which Pm converges weakly towards Pq, 
which would imply that for any continuous functional T{P), the estimate defined as 
T{Pm) will converge to T{Pq), i.e., we obtain almost sure consistency without assuming 
any kind of symmetry. 

Let Xq be a random element with distribution Pq. As mentioned above we assume 
that E(||Xop) is finite and the covariance operator Sxq, is positive definite when EI is 
finite dimensional and compact and self-adjoint if EI is infinite dimensional. Without 
any loss of generality we also assume that E{Xq) = 0. 

We now introduce the following hypotheses: 

Al. P = (1 - e)Po + eQ, for some < e < 0.5. 

A2.- Xq has a bounded support 5*0, and Q is such that its support Sq verifies that 
d{So,SQ)>6>(}. 

This means that the outliers are far from the bulk of the data. 
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A3.-For all h, \\h\\ = 1, Xo{h) :=< Xq, h > has a derivable density fh that fulfils 

inf |A(t)|>r/, (3.1) 
for some rj > 0, where Sh represents for the support of fh- 

Let {Xjj^, . . . , Xj^} be the subset of {Xi, . . . , X„} with Pq, Pr being the corresponding 
empirical distribution, and /„ = {ji, . . . ,jr}- 

Because of Al, we may assume that Xj = ZiXiQ + (1 — Zi)XiQ, i = 1, . . . ,n, where 
Xio, . . . , Xno are iid random elements with distribution Pq, Xiq, . . . , X„q are iid random 
elements with distribution Q and Zi, . . . , Zn are iid random variables with distribution 
Bernoulli(e), being independent the three sets of random elements. 

The aim is to show that du{Pm,Po) a.s., where du stands for the Prohorov 
distance when n — )■ oo, and k = k{n) — oo at an appropriate rate. This would imply 
that for any continuous functional T{P), the estimate defined as Ti^P^) would converge 
to T{Pq). Moreover, we show that in the finite dimensional case, the final estimates 
will attain "full-efficiency" , meaning that the output of the algorithm will be the set of 
random elements with distribution Pq. In the infinite dimensional case, an additional 
assumption will be required in order to attain the "full-efficiency". Otherwise, under 
some symmetry conditions on the distribution Pq, regardless of the distribution Q, the 
estimate will still be consistent. 

A sketch of the proof may be made on the following lines: 

First we prove that since e < 0.5, for a sufficiently large n we have that r > n — r 

a.s. 

Since r = this is equivalent to show that Yl'i=i Zi > n — J2i=i ^-S- for 

sufficiently large n, which follows because by the Hoeffding inequality we have that 
^(Er=i Zi>1)< e-2"(o-5-^)^^ and e < 0.5. 

Second, we show that given Xi with distribution Pq and any Xj with distribution 
Q, for a; in a subset of probability 1 on Q, there exists Si{u) > and a set of strictly 
positive probability of directions on Q' for which 



\ < X„h> - < Xj,h> \ > 6i{uj). (3.2) 

Indeed, by assumption A2 , we have that < Xi — Xj,Xi — Xj » 6 for all j such 
that Xj has distribution Q, with probability 1. There is therefore a cone of directions 



h where (3.2) holds 



This implies that in the presence of outliers, for sufficiently large values of k we can 
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find as many directions as are required for which the maximal spacing of the projected 
data is be larger than 6i. 

Recall that the value of the threshold q on f is related to the limiting behavior of 
the largest spacing. Deheuvels (1984) characterized it in terms of the local behavior 
of the density of the observations (in our case the one dimensional projections) in a 
neighborhood of the point where it attains its minimum spacing. On Theorem 4, he 
proves that if the distribution has compact support and the density is bounded away 
from zero, i.e. there exits Xq such that f{x) > /(xq) > for every x G Support{f), and 
if / has first derivative then 

1 . rngnfixo) -hg{n) n^„/(xo) - log(n) 
— 1 = hmmt — - — -— < hmsup — - — -— = 1, a.s., 

n^oo log(log(n)) n^oo log(log(?2)) 



thus, the maximal spacing behaves like which converges to zero 

sufficiently quickly. 

Our threshold, which is a constant, is defined as, Cd = where k IS a 

suitable constant greater than 1. 

Since r > n — r, the median of the projected data must lie between miujg/^ < Xi, h > 
and maxig/^ < Xi,h >. 

The foregoing statements and assumption A3 imply that for directions h fulfilling 



(3.2) the maximal spacing will be achieved between some < Xi, h > and < Xj, h > for 
Xi with distribution Pq and Xj with distribution Q or for both Xi, Xj with distribution 
Q. In both cases an observation with distribution Q will be deleted. Therefore, for 
sufficiently large values of k, all outliers will be deleted. 

The foregoing proofs are insufficient, however. We already know from the previous 
assertion that for large enough values of k the set {Xi-^, . . . , Xi^} C {Xj^^, . . . , Xj^}, but 
we must still prove that the set Xj^ , . . . , Xi^ "behaves as an iid sample of Pq" , and that 



m —7- oo. To achieve this goal we need conditions more stringent than (3.1) to hold (see 



(Q below). 

For each fixed direction hi, let gr{hi) denote the maximal spacing of the set {< 
Xj,hi >: j G /„}, with cardinal r. From assumption A3, and the fact that Cd > 
iog{r)+iogiiog(r)) j^ave that P{gr{hi) > Cd i.o.) = 0, and thus, gr{hi) < Cd for r > ri{u) 
for almost all w G fi. 

If that holds, then 

P(max qr(hi) > Cd i.o.) = 0, (3.3) 

i<i<k 

for a sequence k = k{n) — )■ oo, the procedure will end with exactly all observations on 
the set {Xji, . . . ,Xj^}, and the estimate will attain full efficiency. 
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In the finite dimensional case, (3.3) is a consequence of assumption A3 (see Propo- 
sition [T] below), but this is not the case in the infinite dimensional setting. 

Otherwise, we would need to require some symmetry assumptions of the pair (T, Pq). 
More precisely, we would require that 

• Xq has a symmetric distribution around fi := E(Xo), i.e. Xq — fi and fi — Xq 
have the same distribution, and that T{Pq) = T{tk{Pq)), where tk{Po) is the 
distribution of a truncation of the random variable Xq, i.e. tk{Xq) = XqIb(^^k)- 

• The density fh(t) is a decreasing function around < fi,h > for all h, 

to obtain consistency. Indeed, it is be a consequence of the symmetry and Theorem 4 
in Deheuvels (1984), where it is shown that for any 6 > 0, there exists almost surely 
a value of such that, for any n > N, the maximal spacing interval is included in 
(xo - 6,xo + 6). 

Proposition 1. Assume that Sq C M'' the support of Pq is a compact connected set with 
Lebesgue boundary set null, i.e. {OSqI = 0, and that there exists rj > such that the 
density of Xq fulfills f > rj on Sq. Given a sample {Xi, . . . ,X„} of iid random vectors 
with distribution Pq, let 



An{Po) = sup {3 a; e M'^ with x + r5(0, 1) C 5o \ {Xi, . . . , 

relR 

be the maximal multivariate spacing. We then have that 

i) There exists a compact set Si, Sq C Si C M.'^, \Si\ = l/rj, \dSi \ = such that 

P(A„(Po) > c) < P(A„,(Pi) > c), VoO, 

where A„(Pi) is the maximal spacing corresponding to a uniform random sample 
Ui,...,Un on Si. 

a) P(maxi<;<fc (7r(/ii) > c) < P(A„(Pi) > c), for all c > 0. 

Hi) 

nVn — logn — {d — l)loglogn — loga — f/, 

where Vn = A„(Pi)^ ^^]^J^' ^ extreme value distribution P{U < u) = 

" and a and is a known constant. Moreover, 

, r ■ e ^^ri - logn nVn - logn 

a — \ = limmf — ; — ; < hmsup — - — = a — 1, a.s. 

n-5-oo loglogn n^oo loglogn 
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Proof. The proof of i) is based on a coupling argument. We first demonstrate it for the 
simple case of d = 1. It may be assumed without loss of generality that 5*0 = [0,1], 
and f{x) > r] W x E [0,1]. Let g be the uniform distribution on 5*1 = [0,^], and 
Ui,...,Un iid random variables uniformly distributed on [0,1]. Let F and G stand 
for the cumulative distribution functions, and define Xi = F^^(Ui), Yi = G^^{Ui), 
iid samples with distribution F and G respectively. Since for any pair i,j, F{Xi) = 
G{Xi), F{Xj) = G{Xj), and the derivative of F is uniformly greater than that of G, 
we have that |Xj — Xj| < — 1^1, which concludes the proof. 

The argument for d > 1 is a little more involved and can be found in Ferrari et al. 
(2011). 



ii) Let X^^\x) be the nearest neighbor of x among Xi, . . . , X„. For any h, \\h\\ = 1, 
and for any pair i,j we have that \ < h, X^ > — < h, Xj > \ < ||Xj — Xj\\. Therefore, 

max(?„(/iz) < A„(Po), (3.4) 



and the conclusion follows from (3.4) and i). 

iii) These results were obtained by Janson (1987, Theoreml). 

□ 



4 Simulations 



In this Section we discuss the performance of the location, correlation estimates proposed 



herein, as well as robust principal components estimates. In Subsection 4.1| we analyze 
the finite dimensional case and the infinite dimensional case is considered in Subsection 



4.2 In both cases we compare the RT-estimate with other classical robust estimates 



and with a benchmark, which is defined below. 



4.1 Finite Dimensional Case 

We report on the results of a Monte Carlo study, the aim of which was to compare 
the performance of the Random Trimming (RT) estimates for location and the correla- 
tion matrix. It is well known, that the distributions of the original data set P and the 
trimmed data set P are different, and if P is elliptical the center fi, coincides with the 
center Jl of P. However, the covariance estimate S of P and the covariance estimate S 
of P have the same shape but differ by a constant c. In order to avoid this problem. 
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we estimated the correlation matrix for every case that did not depend on constants. 
Another approach could have been the computation the minimum and maximum eigen- 
values or the matrix condition number. We herein compare our estimates with two 
different estimates and a benchmark, which we describe below. 

a) The Minimum Volume Ellipsoid (MVE) estimate. The MVE estimate was intro- 
duced by Rousseeuw (1987); among all the ellipsoids {x : d{x,fi,T,) < c} that contain 
at least half the data points, the one given by the MVE estimate has the minimum 
volume. Since the computation of this estimate numerically is very expensive unless p 
and n are small, a sub-sampling procedure as described by Maronna et al. (2006) is 
carried out, we denote these estimates (MVEF). We consider these estimates because, 
despite being very inefficient, they achieve the highest possible breakdown point, if the 
points are in a general position (this means that no hyperplane contains more than p 
points). In addition, these estimates are the initial estimates usually considered for the 
computation of S-estimates, and a number of statistical programs, including S-plus, R 
and SAS, have internal routines to compute them. In these cases the covariance matrix 
estimation also differs from that of the population by constant, then we compute the 
correlation matrix. 

b) As a benchmark, we compute the mean and correlation classical estimates from 
the data sets without outliers, and we denote these trick estimates. It is clear that this 
benchmark attains the best possible behavior. 

c) Recently, Gervini (2010) introduced a trimmed estimate for location and scatter, 
mainly with a view to handling functional data but it was also defined for general Hilbert 
spaces. Again the idea was to trim the observations that are far away from the bulk of 
the data set. For each observation Xi he defines the a- radius, r,, a G [0, 0.5], to be the 
radius of the smallest ball centered on Xi that contains 100q;% of the observations. In 
the second stage he trimmed the 100/3%, /3 G [0,0.5], of the observations with higher rj. 
We denote these Inter-distances Trimmed Estimates (IT). 

Figure [2] shows an example where the performances of the RT and IT estimates are 
very different. The data are displayed in Figure |2] (a); the grey points correspond to 
the central distribution and the green ones are the outliers. The contamination rate is 
20%. In Figures |2] (b) and (d) the red dots correspond to the observations trimmed by 
considering the IT procedure where the trimming rates were 20 and 40% respectively. In 
the first case it was only the outliers that were trimmed, while in the second case some 
observations from the core distribution were also pruned. In Figures [2] (c) and (d) the 
pink dots correspond to the observations trimmed by the RT procedure for trimming 
rates of 20 and 40% respectively. In both cases only the outliers were trimmed. 

We consider samples of size n = 100 on dimensions p = 10, 20 and 40. In all cases a 
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0.20-IT 



0.20-RT 



m ) 



Figure 2: Scatter-plot of the data with the outhers in green (a), trimmed observations 
by IT 20%, RT 20%, IT 40% and RT 40% ((b),(c),(d) and (e)) are shown in a different 
color. 

fraction (1 — e) of the observations Xj were taken from a multivariate normal distribution, 
and the remaining observations are point wise outliers xq . Because of the equivariance 
properties of all the estimates considered in this study, the normal observations were 
taken with a mean of zero and the identity covariance matrix. Moreover, the values of 
xo were taken to have the form xo= (xq, 0, . . . , 0) and xq took values of 3, 7 and 11. 
For the case of the RT-estimate, a direction was considered to be suitable for trimming 
if the length of the one dimensional gap is bigger than a constant as established on 
Deheuvels (1984) we took k = 3 and /(xq) = 0.0044, which corresponds to ip{3), where 
if is the density of a standard normal distribution. The search is carried out in less 
than 100 random directions. In the case of the IT-estimates we consider a = 0.5, as 
suggested by Gervini (2010). In both cases the trimming proportions were 10, 30 and 
50% of the data, for the RT-estimates this was the maximum allowable trimming and 
for the IT-estimates this is the effective pruning rate. 

First, we compared the location estimates, and the results are given on Tables [T] 
and |2j In every case the RT-estimates outperform MVE and MVEF. The behavior 
of the IT-estimate is remarkable, specially if the outliers are far away from the bulk 
of the data. If 10% of the observations are outliers and the same amount of data are 
pruned there are several configurations where the MSE of the trick estimate is achieved. 
The RT-estimates also perform very well, in that they are better than the IT-estimate 
if Xo = 3, but not if xq = 7 or 11. It is important to note that the MSEs for the 
RT-estimates do not change with the trimming rate, due to the application of a wise 
trimming strategy. 

Tables IH] and m show the MSEs for the estimated correlation matrices. In this case the 
behavior of the RT-estimates is noteworthily, achieving the MSE of the trick estimate, 
and in almost every circumstance, this method outperforms the others. 

Finally, we briefly discuss the computation time. It is well known that MVE is 
very slow for high dimensional data sets. In our simulations, the MVE is on average 
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Table 1: MSE for the location estimates for 10% contamination 
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Table 2: MSE for the location estimates for 20% contamination 
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Table 3: MSE for the correlation matrix estimates for 10% contamination 



between 1300 and 2000 times slower that the trick estimate. The MVEF method takes 
on average half the time that MVE takes to compute the estimates; nevertheless it is a 
very slow procedure. Computation times for the IT and RT estimates are very close to 
the benchmark, particularly for higher dimension. For p = 40 the two methods are only 
30% (respectively 25%) slower than the trick estimate on average. For dimension 20 
they take less than twice the time of the trick estimate to compute the estimations and 
for dimension 10 they are four times slower. In all cases the RT-estimates are slightly 
faster than the IT-estimates. 



4.2 Functional Data 

In this Section we study the performance of the RT-estimates of the location and corre- 
lation functions for the case of functional data in the presence of outliers. We generate 
100 functions, 90 of which follow the central model, X (t) = 30t(l - tf/"^ + e (t), for 
t G [0, 1]. and the remaining 10 observations are distributed under one of the following 
contamination models 

Case A. X (t) = 30 (1 - t) t^/^ + e{t), for t G [0, 1] . The outhers have a different 
shape from the functions generated by the core distribution. 

Case B. X (t) = 30t(l - tf/^ + 2 + e (t), ior t e [0, 1] . In these case we consider 
level shift outhers. 
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Table 4: MSE for the correlation matrix estimates for 20% contamination 
Case C. 



X{t) 



30t(l - tf/^ + e{t) te [0, 0.4) U (0.6, 1] 

30t(l -t)3/2 + 2 + £(t) tG [0.4,0.6] 



In the last case the outliers have the same distribution as the observations generated 
under the central model, except in the interval [0.4, 0.6] where they are shifted as in the 
previous case. In every case the error of the process e (t) follows an Ornstein-Uhlenbeck 
law, i.e., a Gaussian process with a mean of zero and a covariance function given by, 
p(s,t) = 0.3exp (^-^) , for s, t G [0, 1] . 

In Figure [3] we display the three models, the grey functions are generated under the 
central model while the red ones are the outliers. 






Figure 3: Each figure exhibits one of the contamination distributions. 



For each model we performed 500 replicates where a maximum of either 20, 30 or 
40% of the observations were pruned. Each function was observed at equidistant points 
on the interval [0, 1], and the distance between two observations was 0.01. We report 
the mean number of outliers pruned in table |5| The procedure was very successful at 
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Case A 



Case B 



Case C 



Figure 4: In each figure the pruned functions are dark grey, the theoretical central curve 
is blue and the estimated central curve is red. 

Trimmed proportion 
0.4 0.3 0.2 
Case A 8.08 8.06 8.11 
Case B 9.992 9.992 9.994 
Case C 6.43 6.43 6.42 

Table 5: Mean number of outliers detected for each model and trimming proportion. 

detecting outliers in Cases A and B, however these results changed significantly for Case 
C where on average more than three outliers are not pruned. Moreover, in Table [6] we 
show the proportion of observations that are effectively trimmed for each model and 
trimming proportion and, it can be seen that this proportion does not change as the 
trimming rate increases. 

In order to analyze the performance of the central and correlation estimates, we 
consider the L2-error {L2E). Let fi{t) be the estimate of /i(t) and E(t, s) be the estimate 
of s), where t, s : h, . . . ,1^, then L2E{fi) = ^ Y.f^^{fi{U) - fi(ti)y, 

and L^EiE) = E5=i(S(t., 5,) " s,))^. 

Table [7] exhibits the L2-error for the location estimates. The first column shows the 
results for the trick estimate: we consider these results to be the benchmark. From the 
2"''^ to the 4*^^ columns the results for the RT-estimates are displayed for three different 
trimming levels bounds. Finally the results for the IT-estimates proposal with a = 0.5 
and the same pruning proportion as in our case with hard rejection weights are given 



Trimmed proportion 
0.4 0.3 0.2 
Case A 12.11 12.10 12.20 
Case B 13.75 13.75 13.84 
Case C 11.83 11.88 11.96 

Table 6: Mean effective trimming rate for each model and trimming proportion. 
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Trick RT IT 

0.4 0.3 0.2 0.4 0.3 0.2 



Case A 0.0513 0.0637 0.0640 0.0636 0.0725 0.0682 0.0640 
Case B 0.0513 0.0595 0.0595 0.0596 0.0722 0.0677 0.0630 
Case C 0.0513 0.0715 0.0714 0.0711 0.0705 0.0670 0.0628 

Table 7: Mean square errors for the location function estimate for each model and 
trimming proportion for the benchmark (Trick), RT-estimates and IT-estimates. 

Trick RT IT 

0.4 0.3 0.2 0.4 0.3 0.2 

Case A 0.0779 0.1159 0.1160 0.1153 0.1830 0.1544 0.1262 

Case B 0.0779 0.0957 0.0953 0.0960 0.1833 0.1543 0.1213 

Case C 0.0779 0.1213 0.1209 0.1213 0.1814 0.1525 0.1240 

Table 8: Mean square errors for the correlation function estimate for each model and 
trimming proportion for the benchmark (Trick), RT-estimates and IT-estimates. 

on the last three columns. The results for the RT-estimates do not change much for 
the different estimates. This is because the effective trimming rate and the number of 
outliers pruned in each replicate are practically the same. IT-estimates show a better 
performance when the trimming rate is lower, because the contamination proportion is 
even smaller than the trimming rate. For Models A and B our estimates outperform 
the IT-estimates, but in Case C the IT-estimates are better because our detection of 
outliers is less effective. Nevertheless, the performances of the two estimates are very 
similar and close to the trick estimates. Finally, Table [8] shows the same results for 
the correlation matrix estimates. The comparison among the estimates shows strong 
similarities, the main difference being that for Case C our estimates outperform the 
IT-estimates. 

The matlab codes for computing the RT-estimate are included as supplemental files. 
4.3 Other Classical Statistical Problems 

In this Section demonstrate the usefulness of the trimming procedure for other classical 
statistical problems. The best known dimension reduction technique is that of the 
principal components, the aim of which is to find a linear combination of the original 
variables that are uncorrelated and that explain as much of the variability as possible. 
We consider the same models as in Section 4.2, and trim up to 40% of the data. The 
results are plotted in Figure |5} in which it can be seen that in every case all the outliers 
were trimmed. It can be seen that for the cases where the data set is complete (without 
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trimming) the shape of the first principal component weight function is similar to the 
mean of the contamination. In the three cases, the principal components for the trimmed 
data sets are similar to the function corresponding to the data set without the outliers. 




7^ 



Figure 5: In each figure the weight functions for the first principal components are 
represented for each contamination model. The blue function is for the complete data 
set, the yellow function is for the data without the outliers and the red one is for the 
RT data. 



5 A real data example 

We now consider the data set of NOx emissions collected in Barcelona, Spain, during 
the first semester of 2005. The data consists of 127 curves measured 24 times a day, 
and were first analyzed by Febrero et al. (2007, 2008), in both cases they carried out 
procedures to detect outliers. There are 12 days for which observations are missing, so 
they considered the restricted data set conformed by 115 functions. In their first study 
they found a two cluster structure, corresponding to working days and nonworking days 
(i.e., Saturdays, Sundays and hohdays). They also provided location, scale estimates 
and set confidence for the NOx data set. 

Furthermore, they introduced a distance based procedure to detect outliers. They 
apphed the procedure to the entire set of 115 observations and also to each of the two 
clusters separately. Febrero et al. (2008) proposed a depth-based criterion for outlier 
identification, and by using different depth notions on the same data set they proceeded 
to identify the outliers. In this case they did not analyze the whole data set, but instead 
they took the subsets of working days and non-working days separately. 

The outliers detected by Febrero et al. for the non-working days in both papers are 
the same (March lO*'* and April 30*^), while the outliers for the working days do not 
strictly coincide: in their second paper they detect two outliers (March 18*'' and April 
29*'') while in their first they also detect March 11*'' to be an outlier. 

In this Section we apply our procedure to detect the outliers in the data set and 
compare the results with those obtained by Febrero et al. (2007, 2008). 
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When trimming up to 5% of the observations, the results obtained for the working 
and non- working days coincide with those obtained by Febrero et aL (2007). Figure [6] 
(b) and (c), shows the observations corresponding to the working and non- working days 
respectively. 

In both cases the solid red observations are the outliers detected by the three proce- 
dures and the dashed red line is the new outlier detected only by Febrero el al. (2008). 

The analysis of the complete data set can only be compared with the results obtained 
on Febrero et al. (2007). We detect three outliers, two of which coincide with those 
found by Febrero el al. (2007) (March IS*'* and April 29*''), and these observations 
are shown in Figure [6] (a) as a solid red line, Febrero et al. (2007) classify as outliers 
two other observations that we do not (March 11^^ and May2"°'), which are plotted as 
dashed red lines and we find in addition an outlier that they did not, which is indicated 
as a solid blue line and corresponds to March 16*'*. The NOx level measured during 
the dawn and morning of March 16*'^ was very high, however the NOx level during the 
afternoon is in the bulk of the data. But for March 11*'* and May2"^ the NOx levels 
during the afternoon are high compared with the others NOx levels measured at the 
same time, and in general these values are higher than the bulk of the data only for very 
short periods of time. 

400 
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Figure 6: The red lines stand for the outliers detected by Febrero et al. (2007) and by 
our procedure, the dashed red lines are the observations detected as outliers only by 
Febrero et al. (2007) and the solid blue lines are the observations detected as outliers 
only by our procedure, (a) Outlier detection for the complete data, (b) Outlier detection 
for the working days, (c) Outlier detection for the non-working days. 

6 Concluding Remarks 

We have herein presented a new robust estimation method based on random projections 
that yields robust estimates of location and scatter in general Hilbert spaces. This 
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procedure is data adaptive because it trims only the observations that are far apart 
from the bulk of the data. It is a particulary suitable procedure for high dimensional 
data and for functional data because is these estimates are very fast to compute. 
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